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Abstract 

After the phenomenal success of the PageRank algorithm, many researchers have extended 
the PageRank approach to ranking graphs with richer structures beside the simple linkage 
structure. In some scenarios we have to deal with multi-parameters data where each node has 
additional features and there are relationships between such features. 

This paper stems from the need of a systematic approach when dealing with multi-parameter 
data. We propose models and ranking algorithms which can be used with little adjustments for 
a large variety of networks (bibliographic data, patent data, twitter and social data, healthcare 
data). In this paper we focus on several aspects which have not been addressed in the literature: 

(1) we propose different models for ranking multi-parameters data and a class of numerical 
algorithms for efficiently computing the ranking score of such models, (2) by analyzing the 
stability and convergence properties of the numerical schemes we tune a fast and stable technique 
for the ranking problem, (3) we consider the issue of the robustness of our models when data 
are incomplete. The comparison of the rank on the incomplete data with the rank on the full 
structure shows that our models compute consistent rankings whose correlation is up to 60% 
when just 10% of the links of the attributes are maintained suggesting the suitability of our 
model also when the data are incomplete. 
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1 Introduction 

Ranking algorithms are essential tools for searching in large collections of data and without them 
it would be extremely difficult to find the desired information. Following the introduction and 
the success of PageRank and similar ranking algorithms [H [16], researchers have extended such 
techniques to a multitude of domains misiiiiiiaini. 

In this paper we consider the setting in which the data consist of a collection of linked items, where 
each item has a set of additional attributes (features). In this setting we assume that the ranking 
of items with common attributes are mutually influenced. Many important problems are instances 
of this general framework. In bibliographic ranking items are scholar papers and their citations give 
the linkage structure. For each paper its associated features are its authors, the journal where it 
appears, subject classification and so on. In patent data items are patents linked by the citations 
to older patents. To each patent we can associate inventors, firm, examiner, technologies, etc.. 
Other examples are social or twitter graphs — where we have information about the status, the 
geographical location, the education, etc., of users. In healthcare data we have patients, doctors, 
treatments, diseases, etc.. With a little abuse of notation in the following we informally use the term 
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“multigraph” to denote this kind of relationships between items and features, while other authors 
identify this kind of graph with as heterogeneous information networks j2f)j . 

In this paper we describe different models for representing the multigraph structure of a network, 
and analyze different techniques for assigning weights to features and to use these weights in the 
ranking process. These weights capture the importance that each link confers to the linked object. 
We then build a fast and stable numerical method for computing the ranking score according to our 
models. The proposed algorithm is obtained by combining two non-stationary methods (BCGStab 
[21] and TFQMR [21]) and a final phase of iterative refinement. 

We perform many tests on two large datasets of patent data extracted from the US patent 
office: the first dataset consists of all the patents granted in the period 1976-1990 (roughly 2.5 
Million patents), and the second of those issued between 1976 and 2012 (almost 8 Million patents). 
The experiments aim at understanding the role of the parameters involved in the algorithm and the 
differences between the various models while comparing the results with those returned by PageRank 
and the ordering induced by the citation count. 

We briefly investigate also the robustness of our models when data are incomplete and unrecov¬ 
erable. In this setting our goal is to use all the information available without advantaging players 
(items or features) with more complete data respect to those where some information is missing. 
We treat unknown values as zeroes, in the sense that we do not distinguish between missing (not 
available) or absent (not existent) features. This choice is the simplest one and the one implemented 
in patent repositories and in many citation databases such as Scopus, Mathscinet where, for exam¬ 
ple, a citation is not attributed to anyone when the name of an author has been misspelled. To 
evaluate the robustness of our ranking schemas on possibly incomplete data, following the approach 
in related literature Enns], we randomly remove features from items with an assigned probability. 
Our experiments show that, even removing up to half of the features, the ranks provided by our 
algorithm highly correlate to the ranks computed on the complete data. As expected, as more and 
more features are removed, the ranks converge to the rank obtained using only the linkage structure. 

Finally, we tested the robustness of our models with respect to the granularity of the features. 
For example if we are dealing with bibliographic data we can group papers into subject classes where 
the granularity can be subject macro areas (Math, Computer Science, etc.) or finer classifications 
(Algebra, Number Theory, Calculus, Algorithms, Data Bases, etc.). In this context it is desirable 
that, when using a finer classification, the sum of the ranks of topic A subtopics is close to A’s rank 
computed using the coarser classification. Experiments with the US patent dataset show that most 
of our models have such desirable features. 

The paper is organized as follows. In Section o we formally introduce the problem we are 
considering in the paper; in Section |1.2| we motivate our study and connect the techniques and 
the algorithm we propose with the existing literature. In Section [^ we briefly present some models 
discussing how extra information and features can be added to the citation structure to improve 
ranking and possible weighting criteria for such features. In our models the ranking is obtained 
approximating the Perron vector of a suitable stochastic matrix. 

In Section [^ we discuss different ways for approximating the Perron vector showing that it can 
be obtained computing the solution of a linear system. In Section]^ we discuss different methods for 
the numerical solution of such linear system and we describe the databases used for the experiments. 
In Section [^ we report an extensive numerical testing to compare the different models in terms of 
convergence for missing data and consistence for class aggregation. Section [^contains the conclusion 
and some discussions about possible improvements of the models. 

1.1 Preliminaries and notations on multigraphs 

In this paper we consider a multigraph as described by a directed graph G = (U, E) and two mapping 
functions, one for the nodes t :V ^ A and one for the edges (j) : E ^ TZ. Each node v G V belongs 
to a particular type t{v) G A and each edge e G E belongs to a particular type of relation (j){e) G TZ. 
Functions </> and r are such that if ei and 62 are two edges, ei = {vi,V 2 ) and 62 = {wi,W 2 ), with 
0(ei) = ^{>( 62 ), then t(i>i) = t{wi) and t{v 2 ) = t{w 2 ). When \A\ > 1 and |77.| > 1 we say that the 
graph is a multigraph. 
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A typical multigraph is a patent network, where each Patent has associated different features in 
the set A = {Patent, Technology, Firm, Examiner, Inventor and Lawyer}. The different relation 
types are the edges between patents and firms, patents and examiners, patents and the set of 
inventors, and between patents and lawyers, beside to the edges to other cited patents: each kind of 
edge has a different semantic meaning, for example the connection between inventors and patents 
expresses intellectual property over the patent while the edge between patent and examiner represent 
the fact that a patent was granted by a particular examiner. In Figure it is shown the relations 
between the different features, and the different kinds of nodes. 



Figure 1: Schema of a patent graph. Each patent has a relation with other types of nodes. 

To better understand multigraphs and the information contained and expressed by the different 
types of relations (edges) between nodes, we associate to the graph a model describing the interaction 
between items, features and the possible interactions between features. In this paper we define several 
models and compare them on the basis of a ranking function inspired by the PageRank algorithm. 
In particular the original network schema is enriched by including in the model other information 
that can be derived from the relations between nodes, such as the network of co-inventors, or all the 
combinations between any two couples of features, i.e. firm-technology or examiner-inventor, etc. 
These enriched models allow to define a ranking function mapping objects to a real non-negative 
score representing the importance of the object. The rank accounts of all the information available 
and not only of the citation network, and allows to rank all types of nodes on the basis of the linkage 
structure of the enriched graph. 

1.2 Motivations and Related work 

Ranking algorithms are essential when searching in large collections of data, being either web pages, 
bibliographic items or even healthcare data. Recently, many ranking algorithms have been devel¬ 
oped [g ng [m ES] which take advantage of the specific structure of the underlined graph. Also in 
the area of economics it is common practice to use ranking metrics for evaluating the performance of 
markets and country economies. Recently, [53] has proposed a ranking algorithm based on PageR¬ 
ank for patent data. Despite the whole information about patents is available from the USPO (US 
Patent office) only the citation structure has been considered in [53] and the multigraph structure of 
the patent graph, including also information on firms, inventors and technologies, has not been fully 
exploited. Since patents are often used to measure innovation of entrepreneurial activities [SEO] a 
ranking schema taking into account all the features of patents can be used not only for evaluating the 
innovation of the patented idea or product, but also to evaluate firms or for portfolio management. 
This is the primary reason we tested the ranking algorithm presented in this paper on patent data, 
even if the technique we present can be applied to any multigraph structure. 

Comparing different ranking algorithms is a very difficult task since for this problem no golden 
truth is available. In some cases it is possible to take a panel of volunteers and let them manually 


3 









evaluate the data, but in most cases, either for the size of the data or for the expertise required, 
this is not possible. For example, manually ranking patents requires a remarkable knowledge of the 
field and such expertise is not easy to find. Another difficulty in comparing ranking algorithms is 
that we can use the same data to discover different properties: in this case a direct comparison is 
not possible. For instance, if we want to evaluate scholars on the basis of their ability to work in 
a team, we will design a ranking function highly valuing the scholars with many coauthors, while 
if we are interested in scientihc personal strength it is natural to normalize each publication by the 
number of coauthors. The resulting rankings will be completely incommensurable. 

In this paper we propose a tunable ranking algorithm where by changing parameters we can 
accomplish different goals. In particular, the same algorithm can be used on different kind of 
data and for different purposes. One of the parameters is the model itself and another one is the 
weighting strategy. This is the major difference with previous ranking algorithms which are designed 
for specific networks and appear to be less tunable [m m [Ml Hz]. Together with the models we 
propose and analyze some weighting strategies. To change the ranking function one can implement 
other weighting strategies and incorporate them into the algorithm. 

In the following we review other approaches for ranking multigraphs and compare them with 
our strategy. The problem of ranking “multigraphs”, as informally defined above, has been recently 
considered in some specific domains. In |25] the multigraph is transformed into a layered graph with 
a layer for each feature. The ranks of each layer are computed independently and the final ranks are 
obtained with a linear combination of the layer ranks’. We believe the independent computation for 
each layer does not fully take advantage of the structure of the problem. For the specihc domain 
of bibliographic ranking, the PopRank algorithm powering Microsoft Academic Search introduced 
by Nie et al. m is a two phase extension of PageRank applied to typed multigraphs with different 
weights on the links. In particular the formula for the PopRank score combines with weight e the 
so called “web popularity” which is a measure similar to the PageRank and with weight 1 — e the 
popularity propagation factor of ongoing links. This factor is based on the importance of links 
pointing to an object and is computed with a learning based technique which automatically learn 
the popularity propagation factor for different types of links using the partial ranking of the objects 
given by domain experts. This ranking schema is very different from ours since PopRank uses an 
external human contribution and is therefore problem dependent and impossible to replicate on a 
different dataset. 

A different approach for ranking multigraphs is the one which makes use of multilinear algebra 
and tensors for representing graphs with multiple linkages mi nan]. The tensor however does not 
contain the same information we use in this paper. For example, if we are dealing with bibliographic 
data, our models use the full author list for each paper, while the tensor only records the number of 
common authors between each pair of papers. Hence it does not allow to obtain a score for all the 
features such as authors or journals, and hence it is not possible to compare its results with those 
provided by our algorithm. 

Sun and al. (Ml [23 in the context of a bi-typed network (for example a bipartite bibliographic 
graph with only authors and conference venues) or star-typed networks (for example a bibliographic 
graph where we have papers and all the other features such as authors, conference venues, terms, 
are linked via papers) propose a ranking schema combined with clustering, where the clustering 
algorithm improves the ranking and vice-versa. One of the ranking function proposed is similar to 
ours but applies only to the simpler graphs described above with only two types of nodes. In |M] the 
authors, still in the context of bibliographic data, proposed a model similar to one of our models, 
namely the Simple Heap model ([^. It mainly differs from ours for the weighting strategy and 
the use of a non-static model. However we consider an enriched structure with a complete set of 
relations between features. For example in the contest of bibliographic data we enrich the graph 
adding weighed links between authors, journals, and subject classification. 

In previous papers from the same authors mnaE] a model is introduced in the contest of bibli¬ 
ographic data which is similar to one of the models (the one we called Stiff model) of this paper. 
In particular in [^ an integrated model for ranking scientific publications together with authors and 
journals was presented. In that context, particular weighting strategies were implemented |5] and 
an exponential decay factor was introduced m to take into account aging of citations, i.e. the fact 
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that if an old paper is not cited recently its importance should fade over the time. In this paper we 
further generalize the original ideas introducing several models and other classes making the model 
suitable also for ranking other multi parameters data (patents, healthcare, social data etc.). The 
new models are more adequate for example to handle updating of the datasets which can be done at 
a lower cost than in the Stiff model. In addition, in this paper the weighting strategies are problem 
independent, while in the previous papers they were designed ad hoc for dealing with bibliographic 
items. 

Another contribution of this paper is the investigation of adequate numerical techniques to 
compute the ranking score. In particular, in Section]^ we show how the computation of the ranks 
relies upon the solution of a structured linear system and in Section we discuss and compare the 
different algorithms which can be used to solve that system. Dealing with big data requires indeed 
particular care in the choice of the numerical methods used in the algorithms that should be stabile 
and fast. The final algorithm (Procedure SystemSolver in Section]^ has been chosen on the basis 
of several tests aiming to validate its properties of convergence and stability. A similar analysis 
has not been done in the literature, and often even methods requiring matrix manipulations m or 
spectral algorithms |29j miss to analyze this important aspect. 

Another contribution of the paper, is a first analysis of the robustness of the algorithms in the 
presence of missing data. Many real-world data have missing entries and many techniques have been 
developed to deal with incomplete data and to make it possible to use those dataset. A common 
practice- and the easiest to apply- is to use only the items with complete information discarding 
those with incomplete data [20) . This is a rather drastic approach especially when a large portion 
of data is incomplete. As an alternative, researchers have proposed to fill in a plausible value for 
the missing observations. Among statisticians distributional models for the data, such as maximum 
likelihood miEg and single or multiple imputation [22l [23] , have been developed to replace non 
ignorable missing data. The goal of this paper is not however to study the preprocessing of data 
for recovery missing features. This topic would require adequate models and techniques [HIIDI to 
recover data and fill in the missing entries. In this paper we are only interested in quantifying how 
the ranking score is affected when some of the data are missing (completely) at randoirQ To this end 
we assume that a missing entry corresponds to a zero value in the linkage structure, such as is done 
in large bibliographic databases such as Scopus, dblp, or even the web when to a broken link we do 
not associate any link. We are aware that replacing a missing value with a zero is not a good choice 
when the data do not have homogeneous attributes [3^ , but in the case of bibliographic data, patent 
data or other networks fitting into the model of Figure]^ the set of the features is homogeneous. 
For instance, any paper has at least an author, a publication venue, etc. Adding and removing links 
at random is a common practice when evaluating performance of ranking algorithms on large social 
networks to measure the tolerance of ranking against spurious and missing links [T31 130113T] . In 
Section we show that also for our algorithm the ranking obtained with incomplete data highly 
correlates with the ranking obtained with the full dataset. Of course, our analysis does not rule out 
that in certain contexts an appropriate preprocessing for recovering missing data can improve the 
ranking provided by our algorithm. 


2 Models 

In Section |2.I| we present a link-based ranking for a simple citation graph. In Section |2.2| we enrich 
the graph with additional information (features) on the nodes. 

2.1 The One-class model 

In this model we have a citation matrix C, where = I if node i links to node j. There are many 
example of such matrices for example the web graph or the graph representing citations between 
scholar papers. 

^The data are missing completely at random (MCAR) when the probability that a data is missing cannot depend 
on any other data in the model [2|. Alternative assumptions have been studied in the literature I18ll22| [2l such as the 
Missing at Random (MAR) or the Not Missing at Random (NMAR) cases. 
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Following an idea similar to Google’s PageRank [5] , we assume that the importance pj of node j is 
given by the importance of the nodes i citing j, scaled by di, the outdegree of i. The importance given 
by i is thus uniformly distributed among all the cited nodes, and the principle that the importance 
of a subject is neither destroyed nor created is respected. 

Here and below, we denote by e the vector of appropriate length with all components equal 
to one. We denote by the fc-th column of the identity matrix of appropriate size. The size 
of vectors and matrices, if not specified, is deduced by the context. Given a vector v = (vi) of 
n components, with the expression diag(t>) we denote the n x n diagonal matrix having diagonal 
entries Ui, i = 1 ,..., n. 

Since nodes may have an empty set of links, the matrix C can have some null rows and in that 
case the corresponding outdegrees di are zero. To avoid divisions by zero we introduce a dummy 
node, numbered n + 1, which cites and is cited by all the existing nodes except itself. The new 
adjacency matrix of size n + 1, denoted by C, has no null rows and is irreducible. The dummy node 
collects the importance of all the nodes and redistributes them uniformly to all its neighbors. 

The outdegrees di = define the vector d = {di), which satisfies the equation d = Ce. 

Moreover, since di ^ Q for all i, the matrix 

P = = diag(d)“^C 

is row-stochastic, that is, 0 < pij < 1, J2jPi,j — 1- 

A similar approach is used in the PageRank model where C is first normalized by row, and 
then a random jump probability a is introduced to make the matrix irreducible. In our model the 
probability to reach the dummy node is not the same for all nodes, but varies accordingly with the 
outdegree of each node. 

The ranking or “importance” of each node is computed solving the following equation 

= x^P, P = diag(Ce)-iC. (1) 

Since the matrix diag(Pe)“^G is nonnegative and irreducible, from the Perron-Frobenius theo¬ 
rem [28] there exists a unique vector x = (xi) such that Xi > 0 , = 1 ) which solves 0 . 

We call X the Perron vector of P. 

This model, that we call One-class has been introduced in j^. It has been used to rank 
scientific papers [5| and patents [24] . In |7], assuming the citation matrix triangular, this model and 
the PageRank model are viewed as special cases of a family of Markov chain-based models. 

2.2 Multi-class models 

Often, beside the linkage structure we have additional information that can be profitably used in 
the ranking process. For example, to evaluate a paper we can use, besides the received citations, 
other information available such as the authors or the journal where the paper has been published. 
We now show that the mixing of all these ingredients (in this example authors, citations, journals) 
makes it possible to compute a better ranking for papers and, at the same time, a ranking score also 
for journals and authors. 

The idea is to compute a ranking value for authors based on the quality of their papers and of the 
journals where the papers appeared. Journals can be evaluated as well using the information about 
the importance of the authors writing for that journal and of the papers published therein. This 
approach was first proposed in [^ and further extended in [HIS]. We start with the original citation 
matrix C, then we add the information on the features of each item storing them in rectangular 
matrices. Examples of features are authors and journals if the items are scholar papers; or firms, 
inventors, technologies and lawyers if the items are patents. In general, we have /, / = |M| — I, 
rectangular binary feature matrices Fi,...,Ff (one for each feature) where entry {i,j) in is 
different from zero iff item i has attribute j for feature k. For patent items, for example, we have 
the “inventorship feature matrix” storing information about the inventors of a patent, that is, entry 
{i,j) is nonzero if j is an inventor of patent i. 

Given the nc x nc citation matrix C, the feature matrices Fk, for k = 1,2,..., f where each Fk 
has size nc x nk, and some weights a^-, we can construct a block matrix A of size N = nc + J2i=i 
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in different ways leading to different models. Note that the size of A is equal to the number of items 
plus the number of attributes for each feature. 

Once we have the block matrix A, we proceed as in the PageRank algorithm and we obtain ranks 
for both items and attributes. To compute the ranking score as in 0 we first force irreducibility in 
the underlying Markov chain and then normalize the resulting matrix to get a stochastic matrix P. 

We now show that by varying the structure of the blocks combining the features and the strategy 
for forcing irreducibility we get four different base models. Combining these base models with 
different weighting strategies we obtain a total of 15 models summarized in Table 

Stiff model Each matrix Fk as well as the matrix C is embedded in a matrix with an additional 
row and column as follows 


Fk = 


r Fk 

e 


0 


C = 


c 

e 


0 


The matrix A is 


A = 


F^CFi 

A • ■ 

■ ■ F[ Ff 

F[ 

F^Fi 

f:[cf2 ■■ 

■ F^Ff 


Fjh 


■■ FjCFf 


Fi 

F 2 

Ff 

c 


( 2 ) 


The matrix A is the adjacency matrix of a more complex multigraph respect to the one 
described by the schema in Figure In fact all the possible relations between any pair of 
features is accounted for, meaning that the graph is complete and we have (/ + 1 )^ types 
of edges. The diagonal blocks are of the form F'^CFk and contain the co-citations between 
features. For example, if Fk is the authorship matrix each entry of F^CFk accounts for 
the number of citations between any two authors. For off-diagonal blocks of type F^Fh, for 
example when F^ is the paper-journal matrisj^ each entry accounts for how many papers an 
author has published on a given journal. 

For the construction of the stochastic and irreducible matrix P we proceed as follows. We 
normalize by row each block of matrix A, obtaining the stochastic and irreducible matrices 
Pij, for i,j = 1 , 2 ,...,/,/ -I- 1 , where P/+i j+i corresponds to the row normalization of C. 
Then, given a row stochastic matrix of weights T = ( 7 ^) with i,j = 1, 2,...,/,/ -|- 1, we build 
matrix P as follows 


71.1 ^’i.i 7i,2 Pi,2 

72.1 ^2,1 72,2 ^2,2 


71, /-i-i -Pi,/-i-i 

72, /-|-l -F2,/-|-1 


7/,i Pf,i 
7/-1-1,1 -P/+1.1 


7/,/ Pf,f 
If+ijPf+i.f 


IfJ+i PfJ+i 
lf+i,f+i Pf+i,f+i _ 


(3) 


We called this model Stiff because it lacks flexibility. In fact, if we add an attribute to a 
feature, we need to recompute not only the corresponding Fk and the matrices involving Fk 
in ([^, but also renormalize each of the changed blocks. This approach was followed in mm 
for ranking papers, authors and journal^ In [12] some discussion about possible choices of 
the weights 7 ^ are reported. Note that since the matrix of the weights T is stochastic and also 
the blocks Pij are stochastic, matrix P describes a coupled Markov chain. 

Static model This model differs from the previous because instead of adding a row and a column 
to each of the feature matrices, we add a dummy item to the whole matrix, and then weight 

^In the paper-journal matrix an entry {i,j) is nonzero if the paper i was published on journal j. 

3ln [T 2 II 6 ] each block was normalized in a particular way because row normalization was not always well suited for 
that particular problem. 
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each block with suitable parameters aij. We obtain the matrix 



- ai,iFfC'Fi 

«! 2 F^ F2 

“1./ Fi Ff 

F^ 

~ 

i = 

«2,1 F2 Fi 

Ct 2,2 Fl C Fi ■ ■ ■ 

«2./ F^ Ff 

o: 2 j+ii F2 

e 


q;/,i FJFi 
o^f+1,1 Fi 

Oi f+ 1^2 F2 

c^fjFjCFf 
«/ + !:/ Ff 

Fj^ 

af+ij+iC 



_ 




0 . 


We then normalize by row to get the stochastic irreducible matrix P = diag(Ae)“^A. For 
this and the remaining models proposed in this section, whenever we add a new attribute to 
an existing feature we have to change only the matrix of the feature involved. Indeed, we do 
not need to build the matrix A explicitly but all the computation can be done using only the 
matrices and C. 

The next two models are designed for dealing with problems where the feature data is incomplete. 
For example in a bibliographic database where we only know the first author of each paper. In this 
case, we cannot expect to compute an accurate rank for authors, but still we would like to use the 
available author information to better rank papers. The structure of the blocks is now homogeneous 
among off-diagonal and diagonal blocks so that we can ideally consider all the features heaped in just 
a matrix F containing all the information on the different attributes. Matrix F has size nc x n*, 
rit being the total number of attributes, for example the sum of distinct authors, journals, etc. 
available. Since the Heap model can be used also with complete data we describe the model keeping 
the features distinct, knowing that the features can be squeezed in a unique matrix when the features 
classes are scarcely populated. 

Heap model The Heap model differs from the Static model in the off-diagonal blocks. Blocks F^Fh 
are replaced by F^CFh- In the previous example where F^ was the paper-journal matrix and 
Fh is the paper-author matrix, the entry (i, j) of F^Fh is the number of papers author j has 
published on journal i, while the {i,j) entry of F^CFh is the number of citations from papers 
written by author j to all papers published in journal i. 

Assigning to each block a weight cnj, we get the matrix A 



- ai.iFi^CFi 

ai,2FlCF2 ■■■ 

a.jF^CFf 

Fi 

- 

i = 

0!2,i F 2 C Fi 

Ct2,2 F^ C Fi ■ ■ ■ 

^2jF^CFf 

a2,/-i-ii F 2 

e 


af^FjCFi 
C(f+i,i Fi 

C(f+1,2F2 

c^fjFjCFf 
<^f+ij Ff 

OifJ+i Fj^ 



_ 




0 . 


To get the stochastic matrix P we just normalize A by row. 

Simple Heap model In this model we assume that there is no interaction between features so that 
cross-citations do not influence the rank. 



0 ai,2F^ 


A = 

0 : 2.1 F 0 : 2 . 2 c 

s 





(5) 


where A is a matrix containing all the relations between items and attributes, i.e. F = 
[FI, F2,..., Ff], As already observed this model uses a simplified setting to deal with the case 
where we have incomplete data. 



that as a/+ij+i —>■ 1 and aij —^ 0, for all the other values of i,j, the rank obtained with these 
models converges to the rank obtained with the one-class model. This is however guaranteed because 
in all the models, for the limit value of aij, A collapses to a matrix of the form 


o 

o 


o 

c 





and the rank of the items is the same (up to a scaling factor) of the one obtained with the one-class 
model, while all the features will get an uniform score. 

2.3 Weighting strategies 

Weighting strategies play an important role in the tuning of the algorithm, since by varying them we 
can change the relative importance of features vs citations and consequently change the final ranking. 
We propose hve different weighting strategies for our models, but not all strategies can be applied 
to each model, and for different models two weighting schemes may coincide after normalization of 
the matrix A. 

The simplest strategy is the Uniform (U) one, which corresponds to choosing aij = 1 for each 
i,j = 1,..., f + 1. By adopting this weighting schema the contribution of each class (feature or 
citation) is valued in the same way, independently of its size. This approach appears adequate only 
when the sizes of each class are of the same order of magnitude, otherwise we are giving a bigger 
role in the determination of the ranking to scarcely populated classes. 

For this reason, we also consider schemes that keep track of the size of each class. We have 
different choices. 

Dimension-based (D) We set = rij/nc, and ctij+i = 1. In this way we guarantee that the 
average value of the features are the same |12j . and we do not advantage more populated 
classes respect to those less populated. The weights are the same for each block of columns. 

Double-Dimension-based (DD) We have a symmetric weight matrix, setting aij = where 

o-i = Tiijnc is the normalized size of the i-th feature. In the case the citation matrix is much 
larger respect to the size of Fi, this scheme gives more importance to citations than to features. 

Heap (H) We set a = (X)fe=ithe first / blocks of columns, that is = a for i = 
1,..., / -|- I and j = 1,..., / for the blocks in the last column we get ajj+i = I, for j = 
1,..., f + 1. This weighting strategy is particularly suited for the Heap or Simple Heap model. 

Double-Heap (HH) In this case the weights are not the same along the blocks of columns but 
defining a = (X]fe=i '^k)/nc, we have Uij = for i, j = I,..., /, and the weights of blocks 
in the last column are ajj^i = a, and in the last row af^ij = a. Moreover = I. 

Also this scheme is particularly suited for the Heap or Simple Heap model since they have the 
same value in the upper left blocks. Assuming a < 1 we are giving again more importance to 
citations when determining the ranking scores of the other nodes. 

While it is always possible to apply an Uniform weight to each base model, it doesn’t make sense 
to apply some of the weighting strategies to the Stiff or Static model. In fact the H or the HH 
weighting techniques make sense only when the structure of diagonal and off diagonal blocks is the 
same as in the case of the Heap or Simple-Heap model. Using the H or the HH weighting techniques 
in combination with the Heap model we can rewrite the matrix A in a more compact form collecting 
all the features in a unique matrix F. We get 


A = 

■ ai^iF^CT 

ai,2 


o-i x F 

0^2,2^ 




0 


In Table [l] we summarize the fifteen full models obtained combining the four basic models, with the 
five weighting schemas. 
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models\weigths 

U 

D 

DD 

H 

HH 

Stiff 

Static 

Heap 

Simple-Heap 

Stiff-U 

StaticU 

Heap-U 

SHeap-U 

Stiff-D 

Static-D 

Heap-D 

SHeap-D 

Static-DD 

Heap-DD 

SHeap-DD 

Heap-H 

SHeap-H 

Heap-HH 

SHeap-HH 


Table 1: The 15 models obtained combining the basic models with the different weighting strategies. 


3 Computation of the Perron vector 

In all our models to compute the rank we have to solve an eigenvector problem involving a stochastic 
irreducible matrix. More precisely, we have to find the left Perron vector x such that x"^ = x^ P, 
with P stochastic. We now show that the Perron vector can be computed as the solution of a linear 
system involving a matrix M, where 


( C if / = 
1 A if/ 


separating the last row and column of M we have 


M = 


M u 
0 


where M has size N x N and u, v are suitable iV-vectors (for the Stiff models u, v are the last 
column and row of P in ([^, while for all other models u = v = e). The matrix P is obtained 
normalizing by row M, that is P = diag(Me)“^ M. Let 

D = diag(M e)~^ = 


D{u) 

l/{v^ e) 


where D{u) = diag(Me + u) Setting x'^ = {x'^,Xn+i), where x'^ is an n-vector, the equation 
a;^ = x"^ P can be rewritten as 


x^ = x'^ D(u) M + 

e 

Xn+l = sf"D{u) u. 


( 6 ) 


Since we are interested in the direction of the Perron vector and not in its norm, we can chose 
Xn+i = e, obtaining x'^ = x^ D{u) M + t;^. The vector x is then the solution of the linear 


system 


(/ — M^D{u)^ X — V, 

(7) 

or can be computed by the iterative method 


x'^ 0 +1) = M + v^. 

(8) 


Note that for the Stiff models it is P = I. 

It is important to observe that in the proposed models we can simply work with the matrices Fj 
without explicitly normalize and store the complete matrix M. For example, for the Static model 
in Q the i-th block, i = 1,..., / of the vector Me = Me + u, used for constructing matrix D, can 
be computed as follows 

Zi = ^ aij Ff FjSj + Ui^iFf CFiSi + e^, i = 1,..., / 
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and 


/ 

Z /+1 = '^af+ijFjej + af+ij+iC Bf+i + Sf+i. 

i=i 


The cost for computing is linear in the number of non zeros (denoted as nnz) of all the matrices 
Fi and C, that is 0{J2i nnz{Fi) + nnz{C)) since the matrices Fi are stored in a sparse format and 
the cost of multiplying a sparse matrix by a vector is equal to the number of non zeros in the matrix. 

Letting Wi denote the vectors of length rii, whose entries are the reciprocal of the entries of the 
vectors Zi, and noticing that the i-th diagonal block of matrix D{u.) contains the entries of Wi, an 
iteration of (§ becomes 


Fj + * Wj)F /’CFj + < 




j <f 


*Wf+i)C + Sf+i 


i = / +1, 


where * denotes the component-wise (Hadamard) product between vectors. The component-wise 
products can be computed in + ^c) multiplications, and the total cost of computing the 

new vector is proportional to the number of non zeros in the matrix [Fi,F2, .. .Ff,C] + 

J2i ni +nc- We can proceed analogously on the other models. 


4 Solution of the Linear System with non-stationary meth¬ 
ods 

Once the problem of the computation of the Perron vector is reformulated as the solution of the 
linear system Q we can employ the iterative method described in (|^ or stationary methods such 
as Jacobi or Gauss-Seidel iterations, or the more promising Krylov methods. In fact, also non¬ 
stationary methods need only the computation of matrix-vectors products and are in general more 
effective than stationary ones (see [111 I14j for a comparison between stationary and non-stationary 
methods on similar problems). Recall that to compute the product Mx then, we do not explicitly 
form and store the matrices M and D{u) but we store in sparse form only the matrices of the 
features Fi and the citation matrix C. 

We implemented different Krylov methods, and among them we chose the three more performing: 
BCGStab, GGS, TFQMR (see [IT] for the details on these methods). 

To refine the final result we add a few steps of the iterations ([^ in accordance with the Iterative 
Ref inement algorithm described below. In particular we perform some additional iterative step until 
either the distance of two successive iterations is less than tol or we are stuck and the vector is not 
changing anymore. 


Procedure Iterative Refinement 
Input: tol 

while |jx(®) — II < tol or | ||x(*) — || — ||x(*+^^ — x^*^ || | < tol 

do a step of the iterative method (|^ , i = i + 1 

endwhile 


4.1 Models Validation: Stability and Convergence 

To test the methods for the solution of Q we constructed two datasets with real data extracted 
from the US patent ofhce and we used five features: Firms, Inventors, Technologies, Lawyers and 
Examiners. In particular, we denote by Fi the patent-technology matrix where entry (i, j) is one 


II 




models 

BCGstab 


CGS 


TFQMR 



it 

logio (res) 

it 

logio(res) 

it 

logio (res) 

Stiff-U 

18 

-10.49 

100 

-7.71 

21 

-3.90 

Stiff-D 

23 

-11.77 

100 

-11.20 

19 

-4.75 

Static-U 

35 

-9.03 

100 

-6.25 

40 

-7.83 

Static-D 

39 

-11.13 

100 

-7.22 

37 

-9.60 

Static-DD 

35 

-12.33 

100 

-12.20 

30 

-11.99 

Heap-U 

32 

-9.86 

100 

-7.44 

36 

-8.59 

Heap-D 

36 

-11.26 

100 

-7.73 

38 

-9.74 

Heap-DD 

41 

-11.48 

100 

-9.46 

33 

-11.51 

Heap-H 

36 

-10.83 

100 

-6.46 

30 

-7.72 

Heap-HH 

24 

-9.85 

100 

-7.14 

27 

-8.38 

SHeap-U 

32 

-11.56 

100 

-8.00 

29 

-9.91 

S Heap-D 

32 

-11.72 

100 

-8.51 

28 

-9.85 

SHeap-DD 

37 

-11.43 

100 

-11.83 

28 

-11.97 

SHeap-H 

28 

-10.56 

100 

-8.33 

25 

-9.98 

SHeap-HH 

29 

-11.34 

100 

-6.75 

24 

-10.05 


Table 2: Performance comparison between three Krylov methods on the 15 models on a problem of 
size 3.7 million. 


if patent i uses technology j; by F 2 the patent-firm matrix, recording the hrm owning the patent, 
by F 3 the patent-inventors matrix which maps patents to inventors, by F 4 the patent-lawyers where 
each patent is matched to the lawyers applying for the patent, and by F^ the matrix where at each 
patent is associated the examiners from the US Patent Office who approved the patent. The matrix 
C contains the citations between patents and is almost triangular since each patent can be based 
only on patents from the past. 

DSl: Consists of nc = 2 474 786 US patents from 1976-1990. Of these patents we have additional 
information that can be grouped into 5 major features, namely ni = 472 Technologies, 712 = 
165 662 Firms, = 965 878 Inventors, 714 = 25 341 Lawyers and = 12 817 Examiners, 
giving rise to a matrix A of size nc + which is approximately of 3.7 millions. 

DS2: Consists of 7 984 635 US patents from 1976-2012. The size of the five features are as follows 475 
Technologies, 633 551 Firms, 4088 585 Inventors, 120 668 Lawyers and 64 088 Examiners, 
giving rise to a matrix A of size approximately of 13 millions. 

The feature matrices and the citation matrix C are used to obtain ranks both for patents and 
features, i.e. Technologies, Firms, Inventors Lawyers and Examiners with the techniques described 
in this section. 

When using iterative solvers we have always to address the question of numerical stability. The 
three proposed methods, BCGStab, CGS and TFQMR have been tested on the two datasets with an 
error goal of 10“^^ and with maximum number of iterations equal to 100. For the rehnement steps 
of the power method we set tol = 10“^^. Applying to dataset DSl the three methods to all the 
models we obtain the results summarized in Table where instead of the actual residuals we report 
only their base 10 logarithm. 

It is evident that CGS is inadequate to cope with this kind of problems since after 100 iterations 
we have still a high residual norm. Moreover BCGstab is better then TFQMR since it achieves almost 
always a lower residual norm. For these reasons we restrict our analysis to BCGStab and TFQMR 
comparing them on the dataset of size 13M. We obtain the results reported in Table 

We note that BCGstab is clearly better than TFQMR, but sometimes fails to reach an acceptable 
accuracy. Hence a three step algorithm, described in Procedure SystemSolver has been devised. 
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models 

BCGstab 


TFQMR 



it 

logio (res) 

it 

logio(res) 

Stiff-U 

14 

-10.57 

19 

-3.83 

Stiff-D 

21 

-11.30 

25 

-3.71 

Static-U 

37 

-6.77 

52 

-3.09 

Static-D 

52 

-10.53 

53 

-7.89 

Static-DD 

39 

-11.38 

43 

-8.70 

Heap-U 

36 

-8.87 

47 

-7.58 

Heap-D 

45 

-6.47 

51 

-6.11 

Heap-DD 

41 

-9.40 

41 

-6.56 

Heap-H 

40 

-9.63 

43 

-7.31 

Heap-HH 

35 

-9.49 

43 

-7.52 

SHeap-U 

40 

-9.78 

38 

-7.48 

SHeap-D 

38 

-10.36 

36 

-8.10 

SHeap-DD 

36 

-11.75 

34 

-9.79 

SHeap-H 

31 

-7.90 

35 

-4.54 

SHeap-HH 

35 

-10.63 

35 

-5.91 


Table 3: Performance comparison between two Krylov methods applied to the 15 models of Table 
on a problem of size 13 million. 


Procedure SystemSolver 

Input: Initial guess ErrorGoal, maxiter, tol 

Apply BCGStab with error goal=ErrorGoal and maximum iterations=maxiter 
if res > ErrorGoal 

Apply TFQMR with error goal=ErrorGoal and maximum iterations=maxiter 

endif 

Apply Iterative Refinement with tolerance tol 


Applying this procedure, with ErrorGoal=10“^°, maxiter=100 and tol=10“^^, on both the 
datasets we get the results displayed in Table 

From TablelHwe observe that the models which are more stable for the two datasets considered are 
the Stiff-D, Static-DD, and among the Heap-like models, we have good performance of Heap-DD, 
SHeap-DD. 


5 Numerical Experiments 

The problem of validating a ranking model is rather a difficult task since no ground truth is known 
in the general case. Moreover the validity of a model clearly depends on what we would like to 
measure. For example, if we want to measure the aptitude of a scholar to work in a team we will 
highly value the articles written in collaboration while if we want to measure the scientific strength 
and personal skills, we may want to normalize each of the articles by the number of co-authors. In 
this respect the extreme variety of our models and the different weighting strategies allows to tune 
the parameters to better satisfy the different needs. 

Table summarize the experiments we performed on the two patents datasets. In the first 
set of experiments we compare the different ranking scores obtained with our models with simpler 
ranking methods, namely the Pagerank algorithm applied only to the citation matrix C, the ranking 
provided by one-class model and the simple citation count. The evaluation measure P@N is also 
presented for comparing the top N ranked items by some of our models with simple citation count 
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models 

DSl 

size=3.7M 

DS2 

size =13M 


time(sec.) 

logio (res) 

time(sec.) 

logio (res) 

Stiff-U 

237 

-12.095 

1078 

-11.952 

Stiff-D 

179 

-12.762 

1422 

-12.1884 

Static-U 

(*)239 

-10.536 

(*)2314 

-9.0309 

Static-D 

188 

-11.740 

2096 

-10.536 

Static-DD 

161 

-13.002 

1688 

-11.7138 

Heap-U 

509 

-11.138 

(*)5992 

-9.7579 

Heap-D 

467 

-11.740 

(*)7509 

-10.536 

Heap-DD 

450 

-12.535 

(*)5849 

-11.6019 

Heap-H 

440 

-11.138 

(*)5978 

-9.93399 

Heap-HH 

(*)403 

-11.439 

(*)4999 

-10.235 

SHeap-U 

80 

-11.740 

(*)717 

-11.1381 

SHeap-D 

70 

-11.439 

661 

-11.4391 

SHeap-DD 

67 

-13.107 

604 

-12.3703 

SHeap-H 

86 

-12.041 

(*)662 

-10.8371 

SHeap-HH 

60 

-11.689 

595 

-10.536 


Table 4: Performance of procedure SystemSolver on the 15 models on DSl and DS2. The results 
labeled with (*) are those where TFQMR has been applied since the required precision of 10“^^ on the 
residual norm was not satisfied after 100 steps of BCGStab. 


Purpose 

experiment 

models 

Section 

Comparison (pat. and firms) 

One-class 
PageRank 
^ Citations 
P@N 

All 

All 

All 

StiffD, StaticDD,HeapHH, SHeapHH 

5.1 


p = 0.1 

All 


Incomplete data 

p = 0.5 

Top N va p 

All 

StaticDD 

5.2 

Consistence for class aggregation 

finer 

coarser 

All 

All 

5.3 


Table 5: Description of the experiments performed. 


and PageRank. The top N firms obtains with some of our ranking methods are compared with the 
rank induced by number of patents issued by each form. 

A second set of tests aims at showing that our different models are adequate to deal with incom¬ 
plete data. In order to empirically prove that, we remove increasingly percentages of the attributes 
links to show that when dealing with incomplete database, our methods are still robust in providing 
a ranking “similar” to the one obtained with the full data. Of course, when the majority of the links 
are removed the rank should converge to the rank obtained with the One-class model. A direct 
comparison between the top ranked results with full and partial data is done as well. 

With the third set of experiments we compare the ranking scores of the same algorithms with a 
finer or coarser aggregation in subclasses. 

5.1 Comparison between models 

The experiments reported in this section have different purposes. First we compare the rank provided 
by each model with the rank obtained with the one-class model, with the standard PageRank model 
and with the simple in-link counting. The idea is that the provided rank should differ substantially 
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from the ranking obtained by simply counting the number of citations received, but the presence of 
the features should refine the ranking without completely reversing the importance of the players 
obtained by the one-class model or by the PageRank model. 

In Figure it is shown the rank provided by our one-class model versus the rank provided by 
the standard PageRank algorithm [5]. A dot with coordinates {xi, yi) represent the i-th patent and 
Xi is the ranking score computed with the classical PageRank algorithm, and yi the ranking score 
computed using our One-class model. We see that the two ranks are very alike because most of the 
points are located on a narrow strip along the main diagonal, reflecting the high correlation between 
the two ranks. In fact the only difference in the two models is the probability of reaching the dummy 
node which is 0.15 in the PageRank while it changes accordingly with the outdegree of each node in 
our model. 


ONE 



Figure 2: Comparison of the rank provided by the PageRank algorithm with random jump proba¬ 
bility equal to 0.15 and the one obtained by the one-class model applied to DSl. 



STIFF U 


STATIC U 


HEAPU 
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ttref 


ttref 



60 


60 
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Figure 3: Comparison between the rank provided by three models with an Uniform weighting 
strategy (val) and citation count (#ref). 

Examining the plots of the ranks obtained with all the models in Table versus the number of 
citations received it turns out that the Uniform weighting scheme is not very adequate. In fact, for 
example in Figure|^a), we see that there are objects that rank very high and have very few citations 
while some of those with many citations receive a very low rank value. This effect is less noticeable 
in the Static or Heap models but still the influence of number of citations on the actual ranking 
seems to be too weak. These problems together with the instability observed in previous section 
(see Tables UlllEl noting that for each model procedure SystemSolver performs better with other 
weighting schemes) suggest that uniform weighting strategies are inadequate. 

The results provided by most of the models using a dimension based weighting scheme appear to 
be better. In fact, documents with a high number of citations receive a good ranking score but the 
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Figure 4: Comparison between the rank provided by three models with a dimensional based weighting 
strategy (val) and citation count (#ref). 


rank provided is not simply a citation count. As we can observe in Figure]^ there is not a substantial 
difference in the shape of the cloud of points obtained using different models. Similar results can be 
observed with double-dimension or heap weighting strategy. 

Many authors use the precision-at-N {P@N) measure as evaluation method. This measure is 
defined as follows, for a given N gN 


P@N 


\En n Fn\ 

N 


where En are the top ranked N objects according to the ranking method one has to evaluate, 
and Fn are the top ranked N objects accordingly with the “perfect” ranking. Of course since the 
“perfect” ranking is not available, the top objects are generally manually ranked by volunteers or 
other algorithms are taken into consideration. In our case, when ranking patents it is very hard to 
find reliable volunteers because of the expertise required to find the most valuable patents into a such 
large database. We used instead as comparison the rank provided by PageRank and the citation 
count. Figure for values oi N = 50,100,200 depicts the performance of four of our models, i.e. 
Stiff-D, Static-DD, Heap-HH, SHeap-HH respect to citation count (thick bars) and PageRank 
(thin bars). We note that our methods are more related to the rank produced by PageRank than 
to the simple citation count. The similarity is higher for the SHeap model since in that case the 
attributes are used in a less significant way. Surprisingly enough the Static-DD model shares more 
of 60% of the top hits with PageRank, despite the two models are very different. 



Figure 5: P@N performance of four of our 
models, i.e. Stif-D, Static-DD, Heap-HH, 
SHeap-HH respect to citation count (for each 
color the thick bars ) and PageRank (for each 
color the thin bars) 



Figure 6: For firms the P@N performance 
of four of our models, i.e. Stiff-D, 
Static-DD, Heap-HH, SHeap-HH respect to 
number of patent granted to each firm shows 
an hight correlation. 


The precision measure P@N can be used also to evaluate the firms. In Figure we show the 
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comparison with the rank induced by sorting the firms by the number of patents issued. We see 
that there is a very high correlation with the number of patents issued by a given firm, up to 90% 
for the Stiff-D model. The precision is lower for the Heap-HH model where the citations matrix 
is combined with those of the features mitigating the effect of the the number of patents granted 
by a firm. In all the models in the top position we find very popular firms such as: IBM, Canon, 
Motorola, Philips, Sony, Bell etc.. Among the top results we have also firms such as Bell Labs, or 
Bayer AG, that despite in the time range [1976-2012] have issued a relatively low number of patents 
(2,617 and 896 respectively) show at the top of the list. 


5.2 Convergence with incomplete data 

An important problem when dealing with large collections of multivariate data is the incompleteness 
of the data. To see how robust our methods are when part of the data are missing, we performed 
many experiments leaving the citation matrix unaltered and varying the level of information about 
the features. In particular, we construct feature matrices Fg obtained taking a nonzero from Fg with 
a fixed probability p, that is 

p(F,(*,j) = l) =pFg{t,j). 

Then we replace in all the models the matrices Fg,s = 1,..., / with the matrices Fg. 


ONE 


ONE 


0.5 


ONE 





Figure 7: Comparison between the rank provided by the one-class model and the Stiff-U model 
for different values of the probability p. 

The experiments performed have two different purposes. First, we would like to test if there are 
models for which the rank obtained decreasing the number of nonzero in the feature matrices does 
not converge to the one obtained with the one-class model. In fact a good model should exhibit a 
smooth convergence to the one-class model as p goes to zero. Second, we are interested to see if 
some of the models are predictive, in the sense that the rank obtained with missing data is “close 
enough” to the rank obtained using the full data, suggesting a good behavior when the data are 
partial or missing. 

We report some plots obtained for values of p equal to 1, 0.5 and O.I. For p = 0.1 only 10% of 
the attributes are present so the ranking obtained should be very similar to the one obtained using 
only citations. Plotting the ranking values versus the rank obtained with the one-class model, we 
see that Uniform weighting schemas behave very poorly, since there is no convergence (see Figure]^. 
This fact, confirms the observation in the previous section about the inadequateness of Uniforms 
weighting strategies. On the contrary with the other weighting schemas all models exhibit a good 
convergence, showing the robustness to missing data. In Figure]^ and j^are depicted the results for 
the three values of p for the Static-DD and the Heap-H models. 

To better understand the effectiveness of the proposed methods when links are missing, we can 
compare the rank provided with all the links with that obtained using a small percentage of the link 
of the features. In Figure [T^ and El are depicted the comparison between the rank of the patents 
for dataset DSl, and the rank of the patents using only 10% or 50% of the links of the features 
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Figure 8: Comparison between the rank provided by the one-class model and the Static-DD model 
for different values of the probability p. 



Figure 9: Comparison between the rank provided by the one-class model and the Heap-H model for 
different values of the probability p. 


N 

p=0.1 

1 ! 

o 

50 

62% 

66 % 

100 

74% 

77% 

200 

73% 

77% 


Table 6: Measure of intersection between the top N patents ranked using the Static-DD model and 
the rank obtained with the same model removing each edge of the attributes with probability 0.1 or 
0.5). 


for the Heap-H model. We see that the rank obtained with partial information are not the same of 
those provided using the full matrix, but however the cloud has a reasonable shape, showing a good 
predictive properties of these models for missing data. Moreover, for lower percentage of missing 
links, the cloud is located in an thinner region around the diagonal. 

For Static-DD model, and for the first N position in the ranked list, we measure the intersection 
between the rank provided with the full data and the one obtained with only 10% or 50% of the 
links of the attributes. The results in Table show that the rank of the patents are very similar 
since among the top 100 patents we have that 77 are still in the top position even removing 50% of 
the links of the features, meaning that the most interesting patents show in the top position also 
with incomplete data. 
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Figure 10; Comparison between the rank pro¬ 
vided by the Heap-H model for the patents 
and the same model using only 10% of the 
links. 


Figure 11: Comparison between the rank pro¬ 
vided by the Heap-H model for the patents 
and the same model using only 50% of the 
links. 


Patent number 

Technology 

Subclass 

6895624 

15 

111 Brush and scraper 

6895625 

15 

28 Rotary disk 

6895626 

15 

50.1 Scrubber 

6895627 

15 

98 Floor and wall cleaner: 


Table 7: Four patents in the class 15- BRUSHING, SCRUBBING, AND GENERAL CLEANING, 
with different subclasses. 


5.3 Consistence for class aggregation 

For some problems it is possible to tune the granularity of the subdivision in classes. For example, 
in our databases of patents we can decide how to group the technologies (in classes or subclasses) 
or geographical areas (regions or nations) and for scientific publications we can classify papers on 
the basis of their specific subject classification (there are many subject classifications tables such 
as AMS, MSC, ACM) or use a coarser grain based on disciplines. The granularity chosen depends 
of course on what the ranking is used for, but a good ranking schema should provide compatible 
results when using different granularities. 

As an example, consider the patents in Table All these patents are in the same class 15 
(BRUSHING, SCRUBBING, AND GENERAL CLEANING) but have a secondary subclass as well. 
The rank of the patents obtained using the extended Technologies-Patent matrix should be similar 
to that obtained using a more compact Technologies-Patent matrix where, for example, the four 
patents in Table are all grouped under the same Technology 15. 

Figure shows the comparison of the patents’ ranks obtained using two different Technology- 
Patent matrices. In the compacted model we use only the main technology class, i.e. in the example 
of Table the four patents associated with different subclasses will be classified as belonging to 
the same class 15. In the extended model, on the contrary, we will use a fatter Technology-Patent 
matrix, with a row for each different subclass. We see that the rank of the patents is minimally 
affected by the change. Of course the rank of Technologies changes a bit more. To compare the 
rank of the main 472 technologies (compact model) we summed up the rank of all the subclasses 
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Figure 12: In the first picture a comparison between the rank of the patents provided by the 
Static-D model using the extended and compact Technology-Patent matrix. In the second picture 
the comparison of the ranks of Technologies for the extended and compact model. 

The plot obtained using models with a weighting scheme of type DD are less grouped around the 
diagonal, meaning that the ranks obtained with the compact or extended technologies differ more 
than using the dimension based technique. 

5.4 Considerations about the execution time 

Our ranking algorithm works offline, in the sense that the scores are precomputed and stored as 
is done for Google’s ranking algorithm. The computation of the rank can be done periodically, 
for example for patents, it is reasonable to update the rank weekly, while for scientific papers a 
recompilation after a month would be sufhcient since most of the journals have monthly issues. Our 
algorithms require a time ranging from 20 minutes to 2 hours to compute the ranking on the larger 
dataset DS2 (where the matrix involved has size approximately of 13 millions) on a quad-core Intel 
Xeon @2.8GHz. To search among the documents one has to add a search module and retrieve the 
documents relevant to a given query. The ranking score of the relevant documents can be simply 
obtained pulling out from the list of all the documents sorted by ranking score. 


6 Conclusion 

In this paper we propose several models for ranking multi-parameters data on the basis of the linkage 
structure. We assume the citation matrix is enriched with other attributes (features) that can be 
represented by multi-class models. We use the attributes to improve the ranking process and as a 
by product we obtain a ranking of the attributes as well. After describing the models and different 
weighting strategies for measuring the influence of each feature in the ranking process, we describe 
an algorithm for computing the rank based on an iterative scheme which combines non-stationary 
and stationary methods. We test some of the numerical methods on two large datasets of US 
patents, and we address issues such as stability and convergence of the algorithm applied to each 
model, convergence with incomplete data, and consistence for class aggregation. In particular, the 
experimental part on large datasets shows that these techniques can be used in real applications 
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where we have objects with multiple attributes and where some information can be missing due to 
the errors or incompleteness of the data. To search among the ordered list of objects one has to add 
a searching module and retrieve only objects relevant to a given query in analogy to what is done 
in the context of web search engines. 

As a future research we plan to address the problem of spam introducing mechanisms for pe¬ 
nalizing self-citations and spammers. A possible approach to deal with cheating could consists in 
appropriately weighting citations and in modifying the main diagonal of the diagonal blocks to 
mitigate the influence of spammers on the final rank. 

Another challenging future work is the incorporation of a preprocessing phase aimed at recovering 
missing entries. Unfortunately automatic techniques such the one proposed in m do not seem 
straightforwardly applicable to our case, but maybe an attempt to recover data based on similarity 
of data in specific domains such as bibliographic data can be employed. For particular problems, 
such as bibliographic ranking, static indicators for journals such as Impact Factor or Mathematical 
citation quotient, are available. We plan to investigate how this information can be used in our 
scheme for improving the ranking process or as a starting point to reduce the number of iterations. 
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